Feature selection analysis

Companion analysis to the paper …

R. Torkar and R. Berntsson Svensson

2020-01-20

Introduction

First prepare the data, check for NAs and look at some descriptive statistics. Since we’re using Excel we should be very careful when loading the data to see that nothing goes wrong (data manipulated in an arbitrary way, data types changed, etc.)

d <- read.xlsx("Features.xlsx", sheet = "Features")

This is how the data looks like. In short, 11110 rows and 10 variables.

str(d)
## 'data.frame':    11110 obs. of  10 variables:
##  $ ID                    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ State                 : chr  "Elicited, Prio, Dropped" "Elicited, Prio, Dropped" "Elicited, Prio, Planned, Implemented, Tested, Released" "Elicited, Prio, Dropped" ...
##  $ Team.priority         : num  88 83 957 79 88 76 74 73 72 71 ...
##  $ Critical.feature      : chr  "No" "No" "No" "No" ...
##  $ Business.value        : chr  "No value" "No value" "No value" "No value" ...
##  $ Customer.value        : chr  "No value" "No value" "No value" "No value" ...
##  $ Stakeholders          : num  1 1 0 0 0 1 1 1 1 1 ...
##  $ Key.customers         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Dependency            : chr  "No" "No" "No" "No" ...
##  $ Architects.involvement: chr  "None" "None" "None" "None" ...

Each row has a unique ID. State, which is our outcome (dependent variable), shows how far the feature got in the process. It is of an ordered categorical type, which should be modeled with a cumulative likelihood. There are seven categories:

Team.priority is the relative priority the feature got, \(\mathbb{N} = \{0,\ldots,1000\}\). Critical.feature is a simple ‘Yes’/‘No’ answer (\(\mathbb{Z}_2\)). Business.value and Customer.value are also ordered categorical with three levels and a fourth level called ‘No value’:

Stakeholders have integers, i.e., \(\mathbb{N} = \{0,\ldots,10\}\), and Key.customers the same, but with a different set, i.e., \(\mathbb{N} = \{0,\ldots,60\}\).

Finally, Dependency is \(\mathbb{Z}_2\), while Architects.involvement is ordered categorical:

All ordered categorical predictors (independent variables) should be modeled as monotonic effects [1[1] P.-C. Bürkner and E. Charpentier, “Modelling monotonic effects of ordinal predictors in Bayesian regression models,” British Journal of Mathematical and Statistical Psychology, doi: 10.1111/bmsp.12195.].

table(is.na(d))
## 
##  FALSE 
## 111100

No NAs in the dataset. However, that doesn’t mean that we don’t have NAs. Some of the coding can be a representation of NA, e.g., ‘No value’. In this particular case we know that ‘No value’ and ‘None’ in the dataset actually are values and not a representation of NA.

Finally, we should set correct data types on all predictors.

# ordered categorical
d$State <- factor(d$State, 
                  levels = c("Elicited, Dropped", 
                             "Elicited, Prio, Dropped", 
                             "Elicited, Prio, Planned, Dropped", 
                             "Elicited, Prio, Planned, Implemented, Dropped", 
                             "Elicited, Prio, Planned, Implemented, Tested, Dropped", 
                             "Elicited, Prio, Planned, Implemented, Tested, Released"), 
                  ordered = TRUE)

d$Business.value <- factor(d$Business.value, 
                           levels = c("No value",
                                      "Valuable",
                                      "Important",
                                      "Critical"), 
                           ordered = TRUE)

d$Customer.value <- factor(d$Customer.value, 
                           levels = c("No value",
                                      "Valuable",
                                      "Important",
                                      "Critical"), 
                           ordered = TRUE)

d$Architects.involvement <- factor(d$Architects.involvement,
                                   levels = c("None",
                                              "Simple",
                                              "Monitoring",
                                              "Active Participation",
                                              "Joint Design"), 
                                   ordered = TRUE)

# binary
d$Critical.feature <- ifelse(d$Critical.feature == 'Yes', 1, 0)
d$Dependency <- ifelse(d$Dependency == 'Yes', 1, 0)

Descriptive statistics

We now have a data frame, d, which has all predictors’ types correctly set.

str(d)
## 'data.frame':    11110 obs. of  10 variables:
##  $ ID                    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ State                 : Ord.factor w/ 6 levels "Elicited, Dropped"<..: 2 2 6 2 1 2 2 2 2 2 ...
##  $ Team.priority         : num  88 83 957 79 88 76 74 73 72 71 ...
##  $ Critical.feature      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Business.value        : Ord.factor w/ 4 levels "No value"<"Valuable"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Customer.value        : Ord.factor w/ 4 levels "No value"<"Valuable"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Stakeholders          : num  1 1 0 0 0 1 1 1 1 1 ...
##  $ Key.customers         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Dependency            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Architects.involvement: Ord.factor w/ 5 levels "None"<"Simple"<..: 1 1 1 1 1 1 1 1 1 1 ...

Let’s plot our outcome and predictors so we get a feeling for the distributions. In the margin you will find all plotted.

We see that for State approximately as many features are released (final stage) as dropped in the first state. We also see that it drops off after the initial state.

For Team.priority many features have zero in priority (\(5139\)), and then there’s a bunch of them (\(1516\)) that have priority set to the maximum value, i.e., \(1000\).

For Critical.feature we have an emphasis on ‘No’.

Concerning Business.value and Customer.value they are very similar in their respective distribution (as one would expect).

For Stakeholder and Key.customers we see an emphasis on lower numbers, while for Dependency an emphasis on ‘No’.

Finally, for Architects.involvement we see that in the absolute majority of the cases architects are not involved.

In short, it looks sort of what one would expect, i.e., it’s not hard to find answers to why the plots look the way they do.

However, before we continue, we should standardize some of our predictors so the sampling will be easier, i.e., we simply do \((x - \bar{x})/\sigma_x\), then simply adding multiplying with \(\sigma_x\) and adding the mean, will allow us to get back to our untransformed scale. It’s good practice to store this in new variables and suffix them with _s. At the same time, let’s give our variables abbreviated names.

d$prio_s <- scale(d$Team.priority)
d$sh_s <- scale(d$Stakeholders)
d$kc_s <- scale(d$Key.customers)

d$crit <- d$Critical.feature
d$b_val <- d$Business.value
d$c_val <- d$Customer.value
d$dep <- d$Dependency
d$arch <- d$Architects.involvement

Initial model comparison

We compare two models (with default priors): One grand interecept null model, \(\mathcal{M}_0\), and one with all predictors added, \(\mathcal{M}\), to see if adding predictors makes a difference (if not, then we get depressed and conclude this matter for this time).

First sample the two models with default priors.

M_0 <- brm(State ~ 1, family = cumulative, data = d, refresh = 0)

M <- brm(State ~ prio_s + crit + mo(b_val) + mo(c_val) + 
                 sh_s + kc_s + dep + mo(arch),
               family = cumulative,
               data = d, refresh = 0)

Add LOO fit criterion to the models and then compare their respective out of sample predictions.

M_0 <- add_criterion(M_0, criterion = "loo")
M <- add_criterion(M, criterion = "loo")
(l <- loo_compare(M_0, M))
##     elpd_diff se_diff
## M       0.0       0.0
## M_0 -3170.4      62.0

LOO cleary puts \(\mathcal{M}\) as no. 1. If we assume a \(z_{\text{99%}}\)-score of \(2.576\) it’s clear that zero is not in the interval and that \(\mathcal{M}\) is better, i.e., \(\text{CI}_{z_{99\%}}\) [-3330.16, -3010.54]. We can conclude this by saying that adding predictors to \(\mathcal{M}\) has a significant effect.

Prior predictive checks

Before we start to use our data to do inferences we should think about our priors. Our outcome of interest is State, which is ordered categorical data and as such we will model it using a cumulative likelihood.

First let’s see what priors we would get with a full model where we include all other variables as predictors. We’ll model b_val, c_val, and arch as montonic predictors since they are, precisely as our outcome, of an ordered categorical nature.

First, we set weakly regularizing priors on \(\beta\) and \(\alpha\), while we set the same on our monotonic predictors. Then, after looking at different levels of priors, we concluded that the following priors were suitable for this model.

p <- get_prior(State ~ prio_s + crit + mo(b_val) + mo(c_val) + 
                 sh_s + kc_s + dep + mo(arch),
               family = cumulative,
               data = d)

p$prior[1] <- "normal(0, 0.5)"
p$prior[10] <- "normal(0, 2)"
# For Dirichlet, number of categories K - 1
p$prior[16] <- "dirichlet(2, 2, 2, 2)" 
p$prior[17:18] <- "dirichlet(2, 2, 2)"

Using math notation, the above looks like,

\[ \begin{eqnarray*} \text{State}_i & \sim & \text{Ordered-logit}(\phi_i,\kappa) \\ \phi_i & \sim & \beta_{\text{prio}} \times \text{Priority}_i + \beta_{\text{crit}} \times \text{Critical Feature}_i\\ & + & \beta_{\text{b}} \times \text{Business value}_i + \beta_{\text{c}} \times \text{Customer value}_i\\ & + & \beta_{\text{sh}} \times \text{Stakeholders}_i + \beta_{\text{kc}} \times \text{Key customers}_i\\ & + & \beta_{\text{dep}} \times \text{Dependency}_i + \beta_{\text{a}} \times \text{Architects involvements}_i\\ \beta_{\text{prio}}, \beta_{\text{crit}}, \beta_{\text{sh}}, \beta_{\text{kc}}, \beta_{\text{dep}} & \sim & \text{Normal}(0, 0.5)\\ \beta_{\text{b}}, \beta_{\text{c}} & \sim & \text{Dirichlet}(2,2,2) \\ \beta_{\text{a}} & \sim & \text{Dirichlet}(2,2,2,2) \\ \kappa & \sim & \text{Normal}(0, 2) \end{eqnarray*} \]

Let’s sample from the priors only (it’s enough to use one chain here), and then check against the data to see that they are wide enough.

M_prior <- brm(State ~ prio_s + crit + mo(b_val) + mo(c_val) + 
                 sh_s + kc_s + dep + mo(arch),
               family = cumulative,
               data = d, prior = p, sample_prior = "only", 
               chains = 1, control = list(adapt_delta = 0.95), 
               refresh = 0)

pp_check(M_prior, type = "bars", nsamples = NULL)

If we sample from our prior predictive distribution against our empirical data we’ll see how the priors cover our data. As is evident, by looking at the intervals, our priors seems to be relaxed and should not influence the outcome.

Model inference

Let’s now sample our model and also include sampling of priors in case we need it later.

M <- brm(State ~ prio_s + crit + mo(b_val) + mo(c_val) + 
           sh_s + kc_s + dep + mo(arch),
         family = cumulative,
         data = d, prior = p, sample_prior = "yes", 
         chains = 4, refresh = 0)

Posterior predictive check

Plotting our models predictions, \(y_{\text{rep}}\), vs. our empirical values, \(y\), we see that the model does a very good job in estimating the categories. The credible intervals are very tight, and if we want to say anything negative about the plot to the right then it is that the first and third categories seem to be underestimated and overestimated, respectively, by the model.

If we look at our previous plot, where we used only priors, we can clearly see, when we compare with the plot to the right, that our data has completely swamped the priors.

Diagnostics

Our caterpillar plots look good.

Our \(\widehat{R}\) and effective sample size (neff) look good.

# should be < 1.01
max(rhat(M))
## [1] 1.000998
# should be > 0.1
min(neff_ratio(M))
## [1] 0.4528161

There is no perfect model, but we could claim that is a good model.

Estimates

If we now focus on our population-level estimates we see that our ‘Intercepts’, which in this case are the points between two categories, are very precisely estimated. However, to make sense of them we need to transform the estimates, since the values are on the \(\operatorname{logit}\) scale.

Our population-level estimates (fixed effects) from our model \(\mathcal{M}\). Note that the values are on \(\operatorname{logit}\) scale.

Estimate Est.Error Q2.5 Q97.5
Intercept[1] -1.40 0.03 -1.46 -1.34
Intercept[2] -0.16 0.03 -0.21 -0.11
Intercept[3] 1.10 0.03 1.05 1.16
Intercept[4] 1.50 0.03 1.44 1.56
Intercept[5] 1.56 0.03 1.50 1.62
prio_s 1.50 0.02 1.45 1.54
crit 0.83 0.06 0.71 0.95
sh_s -0.13 0.02 -0.17 -0.09
kc_s 0.02 0.02 -0.02 0.06
dep 0.07 0.05 -0.04 0.17
mob_val 0.13 0.02 0.08 0.18
moc_val 0.04 0.03 -0.02 0.10
moarch 0.07 0.03 0.01 0.14

Intercepts

Let’s take the first parameter, Intercept[1], which was estimated to \(-1.40\), i.e.,

inv_logit_scaled(-1.40)
## [1] 0.1978161

What does 0.2 mean? Well, we have a categorical outcome. So 20% of the probability mass, with a 95% credible interval of [0.19, 0.21], was assigned to the first category Elicited, Dropped. For Intercept[2], 0.46 [0.45, 0.47] was assigned to Elicited, Dropped and Elicited, Prio, Dropped.

\(\beta\) estimates

Let’s turn our attention to the other estimates. The prio_s estimate, 1.5, would then become 0.82 when transformed. But remember the suffix _s! We need to multiply with \(\sigma_x\) and add \(\bar{x}\) from the data, which leads to an estimate of 708.49 [705.5, 711.46].

However, just looking at a point estimate is, quite frankly, not terribly useful. Let’s plot the posterior probability densitities for our population-level estimates (disregarding Intercept\([1,\ldots,5]\) and our monotonic predictors).

Examining the above plot we can say that, on the 95% level, the first two parameters are clearly positive and do not cross \(0\). The third parameter, Stakeholders, is clearly negative. The last two parameters are not significant, but it seems they are slightly positive, e.g., , Key Customer and Dependency has the following 95% credible intervals on the \(\operatorname{logit}\) scale [-0.02, 0.06] and [-0.04, 0.17], respectively. So, traditionally, using the arbitray 95% threshold, they’re considered to not be significant.

To conclude what we’ve noticed so far: Team priority, Critical feature, and Stakeholders are significant on \(\text{CI}_{95\%}\). Let’s now turn our attention to our three monotonic predictors.

Estimates of monotonic predictors

Three monotonic predictors, Business value, Customer value, and Architects involvement were estimated by our model.

Our three monotonic predictors. Our three monotonic predictors.

It’s clear by simply looking at the plot that Business value and Architects involvement are significant and positive. What’s interesting is that Customer value, which is positive, is not significant. One would think that business value and customer value should have, approximately, the same effect, but our model claims that not to be the case. Could it be that it’s harder for experts to estimate customer value, compared to business value? The whole requirements engineering community just felt a disturbance in the force 😂

Conditional effects

Below we plot all conditional effects for our model. The colors represent the different categories, \(1,\ldots,6\) in our outcome State:

  1. Elicited, Dropped
  2. Elicited, Prio, Dropped
  3. Elicited, Prio, Planned, Dropped
  4. Elicited, Prio, Planned, Implemented, Dropped
  5. Elicited, Prio, Planned, Implemented, Tested, Dropped
  6. Elicited, Prio, Planned, Implemented, Tested, Released